The STRING database is a protein-protein interaction prediction database which uses a number of data sources we do not have access to to make protein-protein interaction predictions. Until we have access to these data sources we have to make do with reduced information which can be extracted from publicly available files on the STRING website downloads page. Specifically we will be using the detailed predictions file for homo sapiens:
In [3]:
cd ../../
In [4]:
!mkdir string
In [3]:
cd string/
In [6]:
!wget http://string-db.org/newstring_download/protein.links.detailed.v9.1/9606.protein.links.detailed.v9.1.txt.gz
In [8]:
!gunzip 9606.protein.links.detailed.v9.1.txt.gz
In [12]:
!head 9606.protein.links.detailed.v9.1.txt
By inspection we can see that the protein identifiers here are Ensembl protein IDs. In the InterologWalk Notebook a dictionary was saved to map between our Entrez IDs and these IDs. We can reuse this dictionary:
In [4]:
cd ../geneconversion/
In [5]:
import pickle
In [6]:
f = open("human.gene2ensemble.pickle")
gene2ensembl = pickle.load(f)
f.close()
To map from the above Ensemble IDs to Entrez IDs the dictionary will have to be inverted:
In [7]:
ensembl2gene = {}
for k in gene2ensembl:
try:
for p in gene2ensembl[k]:
ensembl2gene[p] += [k]
except KeyError:
for p in gene2ensembl[k]:
ensembl2gene[p] = [k]
What we would like to do is create a class which stores each of these pairs as keys and each of these feature vectors as values. If it is unable to retreive a feature vector then it should return an empty vector, as that would correspond to each of these evidence terms being zero.
To deal with the fact that the dictionary is not one to one we will have to ensure that each combination of those that map to multiple map to the same feature vector in order to ensure coverage.
In [8]:
cd ../string/
In [9]:
import csv
In [10]:
import itertools
In [11]:
f = open("9606.protein.links.detailed.v9.1.txt")
c = csv.reader(f, delimiter=" ")
c.next()
stringdict = {}
# iterate over rows building dictionary:
for l in c:
#first build the (possibly various) keys
try:
geneids1 = ensembl2gene[l[0].split(".")[1]]
geneids2 = ensembl2gene[l[1].split(".")[1]]
except KeyError:
#give up on pair if they can't be mapped to Entrez
continue
#then iterate over their combinations saving the feature vector each entry
for i1,i2 in itertools.product(geneids1,geneids2):
stringdict[frozenset([i1,i2])] = l[2:]
f.close()
In [12]:
import sys
In [13]:
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
In [14]:
import ocbio.ppipred
In [15]:
strfeatures = ocbio.ppipred.features(stringdict,stringdict.values()[0])
In [16]:
import pickle
In [18]:
f = open("human.Entrez.string.pickle","wb")
pickle.dump(strfeatures,f)
f.close()